1 Introduction

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris finibus leo et nulla maximus auctor. Nam at sapien porta, vulputate leo et, accumsan ex.

1.1 Methodology

This analysis follows the six steps of the data analysis process: ask, prepare, process, analyze, share, and act.


2 ASK Phase: Ask Questions to Make Data-Driven Decisions

The first step in the process is to ask the right questions in order to find the right solution to the problem. It includes defining business task and identify key stakeholders in the project.

2.1 Business Task

The goal of this project is to identify the behavioural differences between the Cyclistic members and the Casual riders among the stations. This project will focus on the trip data according to the hour, day, month and day of the week for both of the customer groups of the year 2022. The final deliverable will recommend 3 suggestions to convert Casual riders into Cyclistic members.

Our business task is to understand how annual members and casual riders use Cyclistic bikes differently. It also aims to extract insights and develop the most appropriate marketing strategies that appeal to casual riders and encourage them to subscribe to annual memberships.

2.2 Key Stakeholders

Three stakeholders include in the project:

  • The Cyclistic executive team who decides whether to approve the recommended marketing program.
  • The marketing director who is responsible for the development of campaigns and initiatives to promote the bike-share program.
  • The Cyclistic marketing analytics team who is responsible for collecting, analyzing, and reporting data that helps guide marketing strategies.

3 PREPARE Phase: Prepare Data for Exploration

The second step is to prepare and collect the data. It includes the following tasks:

  • Collecting and storing the data for further analysis.
  • Determining its integrity, credibility and accessibility.
  • Preparing the data to be ready for data cleaning.

3.1 Data Dictionary

Each historical bike trip dataset contains the following fields:

  • ride_id: unique ID number for all rides
  • rideable_type: type of bike (classic_bike, docked_bike, and electric_bike)
  • started_at: date and time the ride started
  • ended_at: date and time the ride ended
  • start_station_name: name of the station where the ride started
  • start_station_id: ID number of the station where the ride started
  • end_station_name: name of the station where the ride started
  • end_station_id: ID number of the station where the ride started
  • start_lat: latitude of the location where the ride started
  • start_lng: longitude of the location where the ride started
  • end_lat: latitude of the location where the ride ended
  • end_lng: longitude of the location where the ride ended
  • member_casual: type of user (casual and member)

3.2 Data Preparation

It is important to set up and import the data in the environment after downloading and storing CSV files in the desired location. RStudio and R programming will be utilized at this step.

3.2.1 Install and Load Libraries

First, it requires installing and loading the tidyverse library. It is used for the following tasks:

  • Importing and wrangling the data such as readr, tidyr, and dplyr.
  • Visualizing the data such as ggplot2.
# Load necessary library
library(tidyverse)

3.2.2 Import Data

After the environment has already been set up, the next step is to import all 12-month data and then merge them into a single data frame, called trip_data. The read_csv and bind_rows functions are used to read and combine each file.

# Import CSV files into 'trip_data' data frame
trip_data <- list.files(path = "./data/", 
                        pattern = "*-divvy-tripdata.csv", 
                        full.names = TRUE) %>% 
  lapply(read_csv) %>% 
  bind_rows %>% 
  arrange(started_at)

3.2.3 Preview Data

The data is ready to use for the next step. Before going to the next process, let’s preview the data frame using glimpse() and head() functions. From the below output, the data frame contains 5,667,717 rows and 13 columns. It also provide the following data type in each column.

  • Character (chr): ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, and member_casual.
  • Datetime (dttm): started_at and ended_at.
  • Double (dbl): start_lat, start_lng, end_lat, and end_lng.
# Inspect data frame using 'glimpse' function
glimpse(trip_data)
## Rows: 5,667,717
## Columns: 13
## $ ride_id            <chr> "98D355D9A9852BE9", "04706CA7F5BD25EE", "42178E850B…
## $ rideable_type      <chr> "classic_bike", "electric_bike", "electric_bike", "…
## $ started_at         <dttm> 2022-01-01 00:00:05, 2022-01-01 00:01:00, 2022-01-…
## $ ended_at           <dttm> 2022-01-01 00:01:48, 2022-01-01 00:04:39, 2022-01-…
## $ start_station_name <chr> "Michigan Ave & 8th St", "Broadway & Waveland Ave",…
## $ start_station_id   <chr> "623", "13325", "TA1305000009", "623", "623", "1343…
## $ end_station_name   <chr> "Michigan Ave & 8th St", "Broadway & Barry Ave", "C…
## $ end_station_id     <chr> "623", "13137", "TA1305000009", "623", "623", "1343…
## $ start_lat          <dbl> 41.87277, 41.94907, 41.87592, 41.87277, 41.87277, 4…
## $ start_lng          <dbl> -87.62398, -87.64863, -87.63119, -87.62398, -87.623…
## $ end_lat            <dbl> 41.87277, 41.93758, 41.87593, 41.87277, 41.87277, 4…
## $ end_lng            <dbl> -87.62398, -87.64410, -87.63058, -87.62398, -87.623…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
# Inspect data frame using 'head' function
head(trip_data)

4 PROCESS Phase: Process Data from Dirty to Clean

# Create 'trip_data_v2' data frame in order to use it for cleaning data
trip_data_v2 <- trip_data

4.1 Remove Duplicate Observations

# Check duplicate values of 'ride_id' in 'trip_data_v2' data frame
sum(duplicated(trip_data_v2$ride_id))
## [1] 0

4.2 Fix Structural Errors

4.2.1 Bike Type: rideable_type

# Check unique type of bikes
bike_type <- count(trip_data_v2, rideable_type, name = "ride_count")
bike_type
# Change values from 'docked_bike' to 'classic_bike' and assign back to 'trip_data_v2' data frame
trip_data_v2 <- trip_data_v2 %>% 
  mutate(rideable_type = str_replace_all(rideable_type, "docked_bike", "classic_bike"))
# Recheck unique type of bikes
bike_type_v2 <- count(trip_data_v2, rideable_type, name = "ride_count")
bike_type_v2

4.2.2 Member Type: member_casual

# Check unique type of members
member_type <- count(trip_data_v2, member_casual, name = "ride_count")
member_type

4.2.3 Start and End Stations: start_station_name and end_station_name

# Recheck unique start stations and assign to 'start_station_v3' variable
start_station_v3 <- trip_data_v2 %>% 
  count(start_lat, start_lng, start_station_name, name = "station_count") %>% 
  arrange(start_lat, start_lng)

start_station_v3
# Recheck unique end stations and assign to 'end_station_v3' variable
end_station_v3 <- trip_data_v2 %>% 
  count(end_lat, end_lng, end_station_name, name = "station_count") %>% 
  arrange(end_lat, end_lng)
end_station_v3
# List all test stations and assign to 'test_station_list' variable
test_station_list <- c("Pawel Bialowas - Test- PBSC charging station", 
                       "Hastings WH 2", 
                       "DIVVY CASSETTE REPAIR MOBILE STATION", 
                       "Base - 2132 W Hubbard Warehouse", 
                       "Base - 2132 W Hubbard", 
                       "NewHastings", 
                       "WestChi", 
                       "WEST CHI-WATSON")

# Remove test stations and assign back to 'trip_data_v2' data frame
trip_data_v2 <- trip_data_v2 %>% 
  filter(!(trip_data_v2$start_station_name %in% test_station_list | 
           trip_data_v2$end_station_name %in% test_station_list))

# Check number of rows and columns
dim(trip_data_v2)
## [1] 5665349      13
# List all inconsistent words and assign to 'test_station_list' variable
words <- c("*", " - Charging", " (Temp)", "amp;", "Public Rack - ")
# " - midblock", 
# " - North", " - N", " N", "- north corner", 
# " - South", " - S", " S","-  south corner", " - midblock south", 
# " - East", " - E", " E", 
# " - West", " - W", " W", 
# " - NE", " - NW", " - SE", " - SW", 
# " NE", " NW", " SE", " SW"

# Repeat word in the 'words' list
for (word in words) {
  # Change specific texts to make them all aligned and assign back to 'trip_data_v2' data frame
  trip_data_v2 <- trip_data_v2 %>% 
    mutate(start_station_name = str_replace_all(start_station_name, fixed(word), "")) %>% 
    mutate(end_station_name   = str_replace_all(end_station_name,   fixed(word), ""))
}
# Recheck unique start stations and assign to 'start_station_v4' variable
start_station_v4 <- trip_data_v2 %>% 
  count(start_lat, start_lng, start_station_name, name = "station_count") %>% 
  arrange(start_lat, start_lng)

start_station_v4
# Recheck unique end stations and assign to 'end_station_v4' variable
end_station_v4 <- trip_data_v2 %>% 
  count(end_lat, end_lng, end_station_name, name = "station_count") %>% 
  arrange(end_lat, end_lng)

end_station_v4

4.3 Handle Missing Data

# Check missing values each column
na_col_count <- data.frame(colSums(is.na(trip_data_v2)))
colnames(na_col_count)[1] <- "na_count"
na_col_count
# Check unique start stations and assign to 'start_station' variable
start_station <- trip_data_v2 %>% 
  count(start_lat, start_lng, start_station_name, name = "station_count") %>% 
  arrange(start_lat, start_lng)

start_station
# Check unique end stations and assign to 'end_station' variable
end_station <- trip_data_v2 %>% 
  count(end_lat, end_lng, end_station_name, name = "station_count") %>% 
  arrange(end_lat, end_lng)

end_station
# Impute missing start/end station names and IDs by using geographic coordinate as reference
digit <- 5 # Set digits to start with 5

# Repeat rounding digits from 5 to 2
while (digit > 1) {
  # Add four columns to round digits
  trip_data_v2 <- trip_data_v2 %>% 
    mutate(start_lat_round = round(start_lat, digits = digit), 
           start_lng_round = round(start_lng, digits = digit), 
           end_lat_round   = round(end_lat,   digits = digit), 
           end_lng_round   = round(end_lng,   digits = digit))
  
  # Fill in missing start station names & IDs, referencing with 'start_lat_round' and 'start_lng_round'
  trip_data_v2 <- trip_data_v2 %>% 
    group_by(start_lat_round, start_lng_round) %>% 
    fill(start_station_name, .direction = "downup") %>% 
    fill(start_station_id,   .direction = "downup") %>% 
    ungroup()
  
  # Fill in missing end station names & IDs, referencing with 'end_lat_round' and 'end_lng_round'
  trip_data_v2 <- trip_data_v2 %>% 
    group_by(end_lat_round, end_lng_round) %>% 
    fill(end_station_name, .direction = "downup") %>% 
    fill(end_station_id,   .direction = "downup") %>% 
    ungroup()
  
  # Decrement digits by 1
  digit <- digit - 1
}
# Check missing values each column
na_col_count_v2 <- data.frame(colSums(is.na(trip_data_v2)))
colnames(na_col_count_v2)[1] <- "na_count"
na_col_count_v2
# Recheck unique start stations and assign to 'start_station_v2' variable
start_station_v2 <- trip_data_v2 %>% 
  count(start_lat, start_lng, start_station_name, name = "station_count") %>% 
  arrange(start_lat, start_lng)

start_station_v2
# Recheck unique end stations and assign to 'end_station_v2' variable
end_station_v2 <- trip_data_v2 %>% 
  count(end_lat, end_lng, end_station_name, name = "station_count") %>% 
  arrange(end_lat, end_lng)

end_station_v2
# Remove rows containing missing values and assign back to 'trip_data_v2' data frame
trip_data_v2 <- drop_na(trip_data_v2)

# Check number of rows and columns
dim(trip_data_v2)
## [1] 5614669      17

4.4 Add New Columns

# Add three columns into data frame: 'ride_length_min', 'day_of_week', and 'month'
trip_data_v2$ride_length_min <- as.double(difftime(trip_data_v2$ended_at, 
                                                   trip_data_v2$started_at, 
                                                   units = "mins"))
trip_data_v2$day_of_week <- wday(trip_data_v2$started_at, label = TRUE)
trip_data_v2$month <- format(trip_data_v2$started_at, "%b")
# Order levels of 'day_of_week' from Monday to Sunday
trip_data_v2$day_of_week <- ordered(trip_data_v2$day_of_week, 
                                    levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

# Order levels of 'month' from January to December
trip_data_v2$month <- ordered(trip_data_v2$month, 
                              levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
                                         "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
# Inspect data frame using 'head' function
head(trip_data_v2)

4.5 Remove Unwated Data

4.5.1 Irrelevant Columns

# Drop irrelevant columns and assign back to 'trip_data_v2' data frame
trip_data_v2 <- trip_data_v2 %>% 
  select(!c(start_station_id, end_station_id, 
            start_lat_round, start_lng_round, 
            end_lat_round, end_lng_round))
# Check number of rows and columns
dim(trip_data_v2)
## [1] 5614669      14
# Inspect data frame using 'head' function
head(trip_data_v2)

4.5.2 Error Inputs

trip_data_v2 <- trip_data_v2 %>% 
  filter( !(trip_data_v2$start_lat == 0 | 
            trip_data_v2$start_lng == 0 | 
            trip_data_v2$end_lat   == 0 | 
            trip_data_v2$end_lng   == 0) )
# Check number of rows and columns
dim(trip_data_v2)
## [1] 5614661      14
summary(trip_data_v2$ride_length_min)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -10353.35      5.80     10.25     16.29     18.38  32035.45
trip_data_v2 <- trip_data_v2 %>% 
  filter(!(trip_data_v2$ride_length_min < 1 | trip_data_v2$ride_length_min > 1440))
# Check number of rows and columns
dim(trip_data_v2)
## [1] 5494500      14

4.5.3 Potential Outliers

summary(trip_data_v2$ride_length_min)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.05   10.48   16.55   18.67 1439.37
ggplot(data = trip_data_v2, aes(x = member_casual, y = ride_length_min, fill = member_casual)) + 
  geom_boxplot() +
  coord_flip() + 
  theme(legend.position="none") + 
  labs(x = "Member type", 
       y = "Ride length (in minutes)", 
       title = "Box plot showing 'ride_length_min' before removing outliers")

median_value <- median(trip_data_v2$ride_length_min)
q1 <- as.numeric(quantile(trip_data_v2$ride_length_min, probs = 0.25))
q3 <- as.numeric(quantile(trip_data_v2$ride_length_min, probs = 0.75))

iqr_value <- IQR(trip_data_v2$ride_length_min)

lower_limit <- q1 - ( 1.5 * iqr_value )
upper_limit <- q3 + ( 1.5 * iqr_value )
trip_data_v2 <- trip_data_v2 %>% 
  filter(!(trip_data_v2$ride_length_min < lower_limit | trip_data_v2$ride_length_min > upper_limit))
summary(trip_data_v2$ride_length_min)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.783   9.717  11.958  16.150  37.583
ggplot(data = trip_data_v2, aes(x = member_casual, y = ride_length_min, fill = member_casual)) + 
  geom_boxplot() +
  coord_flip() + 
  theme(legend.position="none") + 
  labs(x = "Member type", 
       y = "Ride length (in minutes)", 
       title = "Box plot showing 'ride_length_min' after removing outliers")

4.6 Data Validation

4.6.1 Data Preview

# Inspect data frame using 'glimpse' function
glimpse(trip_data_v2)
## Rows: 5,091,142
## Columns: 14
## $ ride_id            <chr> "98D355D9A9852BE9", "04706CA7F5BD25EE", "42178E850B…
## $ rideable_type      <chr> "classic_bike", "electric_bike", "electric_bike", "…
## $ started_at         <dttm> 2022-01-01 00:00:05, 2022-01-01 00:01:00, 2022-01-…
## $ ended_at           <dttm> 2022-01-01 00:01:48, 2022-01-01 00:04:39, 2022-01-…
## $ start_station_name <chr> "Michigan Ave & 8th St", "Broadway & Waveland Ave",…
## $ end_station_name   <chr> "Michigan Ave & 8th St", "Broadway & Barry Ave", "C…
## $ start_lat          <dbl> 41.87277, 41.94907, 41.87592, 41.87277, 41.87277, 4…
## $ start_lng          <dbl> -87.62398, -87.64863, -87.63119, -87.62398, -87.623…
## $ end_lat            <dbl> 41.87277, 41.93758, 41.87593, 41.87277, 41.87277, 4…
## $ end_lng            <dbl> -87.62398, -87.64410, -87.63058, -87.62398, -87.623…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…
## $ ride_length_min    <dbl> 1.716667, 3.650000, 30.966667, 28.883333, 28.483333…
## $ day_of_week        <ord> Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, S…
## $ month              <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, J…

4.6.2 Empirical Rule

# 
summary_stats <- summarise(trip_data_v2,
                           sd = sd(ride_length_min),
                           mean = mean(ride_length_min),
                           count = n())

# 
calculate_percentage <- function(n_sd) {
  filtered_count <- trip_data_v2 %>%
    filter(between(ride_length_min, summary_stats$mean - n_sd * summary_stats$sd, summary_stats$mean + n_sd * summary_stats$sd)) %>%
    summarise(count = n())

  round((filtered_count$count / summary_stats$count) * 100, 2)
}

# 
percentage_sd1 <- calculate_percentage(1)
percentage_sd2 <- calculate_percentage(2)
percentage_sd3 <- calculate_percentage(3)
paste0("One standard deviation: ", format(percentage_sd1, nsmall = 2), "%")
## [1] "One standard deviation: 72.10%"
paste0("Two standard deviations: ", percentage_sd2, "%")
## [1] "Two standard deviations: 94.08%"
paste0("Three standard deviations: ", percentage_sd3, "%")
## [1] "Three standard deviations: 99.38%"

5 ANALYZE Phase: Analyze Data to Answer Questions


6 SHARE Phase: Share Data Through the Art of Visualization


7 ACT Phase: Act Based on Data and Make Recommendations